Main Analysis (Exploratory Data Analysis)
Our exploratory data analysis of the Yelp Toronto restaurant dataset can be divided into two sections: 1) analysis of geographical distribution of restaurants and 2) analysis of review sentiment and its correlation with rating.
Geographical Distribution of Restaurants
We started our analysis of Toronto restaurants by filtering out all restaurants except those categorized under the top 20 most common cuisines. An initial visualization of the geographical distribution of these restaurants showed that restaurants were most concentrated in the downtown area and became more sparsely distributed the further one moves away from the downtown area.
cuisines_toronto_dedup <- cuisines_toronto[-13] %>% distinct()
qmplot(longitude, latitude, data = cuisines_toronto_dedup, maptype = "toner-lite", color = I("red"), size=I(0.5), alpha=I(.5)) + ggtitle("Restaurants in Toronto")
To fully understand the geographical distribution of restaurants around Toronto and the various factors that might affect it, we explored the other attributes of the restaurants available to us from the dataset:
1) Geographic Distribution of Restaurants in Toronto by Cuisine
Most cuisines follow the same general pattern of being more concentrated in the downtown area and more dispersed outside of that area. However, the Chinese restaurant distribution is noticeable in that it has two separate areas of concentration, one in the downtown area and another a considerable distance to the northeast of the downtown area.
qmplot(longitude, latitude, data = cuisines_toronto, maptype = "toner-lite", color=I("red"), size=I(0.05), alpha=I(0.5)) + ggtitle("Restaurants in Toronto by Cuisine") + facet_wrap(~category)
2) Geographic Distribution of Restaurants in Toronto by Rating
For the purposes of mapping, “Bad”" restaurants had a rating below 3 stars, and “Good”" restaurants had a rating greater than or equal to 3 stars. “Very Bad”" restaurants had less than 2 stars and “Very Good” restaurants had a 4-star rating or higher. There is no noticeable pattern in the distribution of restaurants by rating that departs from the general pattern of restaurant distribution.
cuisines_toronto_rating_dedup <- cuisines_toronto_dedup %>%
mutate(rating = cut(stars, breaks=c(0,2,3,4,5), labels=c("Very Bad","Bad","Good","Very Good")))
qmplot(longitude, latitude, data = cuisines_toronto_rating_dedup, maptype = "toner-lite", color = I("red"), size=I(0.5), alpha=I(.5)) + ggtitle("Restaurants in Toronto by Rating") + facet_wrap(~rating)
3) Geographic Distribution of Restaurants with “Good” and “Very Good” Rating by Cuisine
There is no discernible departure from the general restaurant distribution by cuisine when only “Good” and “Very Good” restaurants were visualized.
cuisines_toronto_rating <- cuisines_toronto %>%
mutate(rating = cut(stars, breaks=c(0,2,3,4,5), labels=c("Very Bad","Bad","Good","Very Good")))
cuisines_toronto_good <- cuisines_toronto_rating %>%
filter(rating %in% c("Good","Very Good"))
qmplot(longitude, latitude, data = cuisines_toronto_good, maptype = "toner-lite", color=I("red"), size=I(0.05), alpha=I(0.5)) + ggtitle('"Good" and "Very Good" Restaurants in Toronto by Cuisine') + facet_wrap(~category)
4) Geographic Distribution of rRstaurants in Toronto by Number of Check-ins
In order to visualize this distribution, we had to join check-in data with restaurant data, and we decided to remove all the rows that had NA values in the check-in column. Less than 44 check-ins was considered “Low”, and 44 check-ins or more was considered “High” for the purposes of these mappings. Less than 13 check-ins was considered “Very Low” and 137 check-ins or more was considered “Very High”. The restaurant distributions by number of check-ins showed the same general pattern as all restaurants.
# reading in check-in data
check_in <- read_csv("~/EDAV_project/Data/restaurants checkins Toronto.csv")
check_in <- check_in %>%
group_by(check_in$business_id) %>%
summarise(n_checkins = sum(count, na.rm=TRUE))
colnames(check_in)[1] <- "id"
cuisines_toronto_check_ins_dedup <- plyr::join(cuisines_toronto_rating_dedup, check_in, by = "id")
# checking for missing values
skim(cuisines_toronto_check_ins_dedup$n_checkins)
## Skim summary statistics
##
## Variable type: integer
## variable missing complete n mean
## cuisines_toronto_check_ins_dedup$n_checkins 471 2742 3213 128.41
## sd p0 p25 p50 p75 p100 hist
## 258.14 1 13.25 44 137 5755 ▇▁▁▁▁▁▁▁
# remove restaurants where n_checkins is na
cuisines_toronto_check_ins_dedup <- cuisines_toronto_check_ins_dedup[!is.na(cuisines_toronto_check_ins_dedup$n_checkins),]
# using summary provided by skim to choose cutoff points for checkin categories
skim(cuisines_toronto_check_ins_dedup$n_checkins)
## Skim summary statistics
##
## Variable type: integer
## variable missing complete n mean
## cuisines_toronto_check_ins_dedup$n_checkins 0 2742 2742 128.41
## sd p0 p25 p50 p75 p100 hist
## 258.14 1 13.25 44 137 5755 ▇▁▁▁▁▁▁▁
cuisines_toronto_check_ins_dedup <- cuisines_toronto_check_ins_dedup %>% mutate(checkin_cat = cut(n_checkins, breaks=c(-Inf,13,44,137,Inf), labels=c("Very Low","Low","High","Very High")))
qmplot(longitude, latitude, data = cuisines_toronto_check_ins_dedup, maptype = "toner-lite", color = I("red"), size=I(0.5), alpha=I(.5)) + ggtitle("Restaurants in Toronto by Number of Check-ins") + facet_wrap(~checkin_cat)
5) Geographic Distribution of Restaurants with “High” and “Very High” Number of Check-ins by cuisine
Here, again, there is no discernible departure from the general restaurant distribution by cuisine when only restaurants with “High” or “Very High” number of check-ins were visualized.
# joining checkin data with cuisines_toronto_rating
cuisines_toronto_check_ins <- plyr::join(cuisines_toronto_rating, check_in, by = "id")
# checking for missing values
skim(cuisines_toronto_check_ins$n_checkins)
## Skim summary statistics
##
## Variable type: integer
## variable missing complete n mean sd
## cuisines_toronto_check_ins$n_checkins 498 3792 4290 155.11 339.98
## p0 p25 p50 p75 p100 hist
## 1 14 50 159 5755 ▇▁▁▁▁▁▁▁
# remove restaurants where n_checkins is na
cuisines_toronto_check_ins <- cuisines_toronto_check_ins[!is.na(cuisines_toronto_check_ins$n_checkins),]
cuisines_toronto_check_ins <- cuisines_toronto_check_ins %>% mutate(checkin_cat = cut(n_checkins, breaks=c(-Inf,5,30,110,Inf), labels=c("Very Low","Low","High","Very High")))
cuisines_toronto_high <- cuisines_toronto_check_ins %>%
filter(checkin_cat %in% c("High","Very High"))
qmplot(longitude, latitude, data = cuisines_toronto_high, maptype = "toner-lite", color=I("red"), size=I(0.05), alpha=I(0.5)) + ggtitle('Restaurants in Toronto with "High" and "Very High" Number of Check-ins by Cuisine') + facet_wrap(~category)
6) Cuisine Diversity by Area
A final geographic analysis that we wanted to perform is understand which areas of Toronto have the highest culinary diversity. As mentioned in the “Data Description” section, we sresulted to creating our own areas using Python’s geaohash library. We then applied a methodology similar to that relating to our effort to indentify which city has the highest cuisine diversity. Namely, we calculated the Gini index for each area and subsequently, investogated which area would have the highest non-weighted index. The result is shown below
# Get review data with geocoded area attached
reviews_area <- read_csv("~/EDAV_Project/Data/Toronto_revs_area.csv")
# Get cuisine frequencies within each area
revs_freq <- reviews_area %>%
group_by(geohash, category) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count),4))
gini <- function(x, na.rm = TRUE) {
index <- 1 - sum(x**2)
return(index)
}
# Get the gini index for each area
div_area <- revs_freq %>%
group_by(geohash) %>%
summarise(gini = gini(perc))
coords <- reviews_area %>%
group_by(geohash) %>%
summarise(median_lat = median(latitude), median_lon = median(longitude))
reviews_area <- plyr::join(div_area, coords, by = "geohash")
ggplot(div_area, aes(reorder(geohash, gini), gini)) +
geom_col(color = "darkblue", fill = "darkblue") +
xlab("Area") +
ylab("Cuisine Diversity Score") +
coord_flip() +
ggtitle("Culinary Diversity by Area")
#ggsave(filename = "fig17.2.png", width = 4, height = 4, dpi = 200)
We can observe that all 10 areas are about equally diverse, with their Gini index ranging from 90% to 97%. However we can see that one area, namely the Garden District has a significantly lower diversity score (80%) than the remining areas. Further analysis would be require to identify what could be the reasons for this patterns.
Sentiment, rating, and check-ins
Our next step was to go deeper into the analysis of the popularity of the restaurants in Toronto. We attempted to gauge the popularity of restaurants and cuisines using ratings, check-ins, and sentiment scores of reviews. We obtained a sentiment score for each review (a value between -1 and 1) using Python’s TextBlob library , and we used these values to get the average sentiment score for each restaurant.
Relationship between rating, number of check-ins and average sentiment
First, we visualized the relationship between the average number of stars, the number of check ins and the average sentiment. The plot below shows a positive correlation between rating and average sentiment, but the number of check-ins tends to be low across all rating and sentiment levels.
# add sentiment and check-in data to restaurants dataset
# adding in check-in data
restaurants_full <- plyr::join(restaurants, check_in, by = "id")
# reading in sentiment data
sentiment_toronto <- read_csv("~/EDAV_project/Data/sentiment_reviews_Toronto.csv")
# rename column name with sentiment value
colnames(sentiment_toronto)[11] <- "sentiment"
# get average sentiment for each restaurant
mean_sentiment <- sentiment_toronto %>%
group_by(business_id) %>%
summarise(avg_sentiment = mean(sentiment))
# add sentiment information to restaurants dataset
colnames(mean_sentiment)[1] <- "id"
restaurants_full <- plyr::join(restaurants_full, mean_sentiment, by = "id")
restaurants_full_dedup <- restaurants_full[-13] %>% distinct()
# remove outlier in n_checkins
restaurants_full_dedup <- restaurants_full_dedup %>% subset(n_checkins<2500)
# remove na values in n_checkins and avg_sentiment
restaurants_full_dedup <- restaurants_full_dedup[!is.na(restaurants_full_dedup$n_checkins),]
restaurants_full_dedup <- restaurants_full_dedup[!is.na(restaurants_full_dedup$avg_sentiment),]
parallel_data <- restaurants_full_dedup[,c(10,13,14)]
colnames(parallel_data)[1] <- "Average Stars"
colnames(parallel_data)[2] <- "Number of Check-ins"
colnames(parallel_data)[3] <- "Average Sentiment"
# parallel coordinate plot of rating, sentiment, and check-ins
parcoords(parallel_data,
rownames = F,
brushMode = "1d-axes",
reorderable = T,
color = list(colorBy = "Average Stars",
colorScale = htmlwidgets::JS("d3.scale.category10()")))
In the following bar chart, we can clearly see a positive correlation between average sentiment scores and ratings:
# read in dataset of all reviews for Toronto businesses, text_1 column is the sentiment value
# sentiment value calculated using python library
toronto_reviews = read_csv("~/EDAV_project/Data/sentiment_reviews_Toronto.csv")
# trim columns of review data frame
toronto_reviews <- toronto_reviews[c("business_id", "date", "stars", "elite", "text_1")]
avg_sent <- toronto_reviews %>%
group_by(stars) %>%
summarize(mean = mean(text_1, na.rm = TRUE))
ggplot(avg_sent, aes(stars, mean)) +
geom_col(color = "darkblue", fill = "darkblue") +
xlab("Stars") +
ylab("Average Review Sentiment") +
ggtitle("Average Review Sentiment by Rating Category")
Having established a strong correlation between sentiment and rating, we wanted to see sentiment and rating distributions for different cuisines.
We visualized the distribution of restaurants by ratings under each cuisine. For the purpose of this analysis, we restricted our analysis to the top 20 most frequent cuisines only. In general, the ratings distributions for all cuisines were left skewed, with peaks at either 4 or 5 stars. We saw that French, Mediterranean, Middle Eastern, and Tapas Bars were among the highly rated cuisines.
# read in restaurants in Toronto file as dataframe
restaurants <- read_csv("~/EDAV_project/restaurants in Toronto.csv")
# read in sentiment information
toronto = read_csv("~/EDAV_project/Data/sentiment_reviews_Toronto.csv")
# read in categories from text file
relevant_cats <- read.delim("~/EDAV_project/Cat_List_cuisines_only.txt", header = TRUE)
# subset restaurants dataset by categories in text file
restaurants <- restaurants[restaurants$category %in% relevant_cats$Category,]
# obtain mapping of business ids and categories
categories <- restaurants[c(1,13)]
# remove user id from toronto sentiment dataset and rename "business id" column as "id"
data <- toronto[-1]
colnames(data)[1] <- "id"
# join toronto sentiment dataset (without user id) with restaurants dataset which has rating and other info
revs_cat <- plyr::join(data, restaurants, by = "id")
# remove rows with missing values in category column
revs_cat <- subset(revs_cat, !is.na(revs_cat$category))
# refactor category and stars column
revs_cat$category <- as.factor(revs_cat$category)
revs_cat$stars <- as.factor(revs_cat$stars)
# obtained counts by category and number of stars and percent for each star level within a category
temp <- revs_cat %>%
group_by(revs_cat$category, revs_cat$stars) %>%
summarise(cnt = n()) %>%
mutate(perc = round(cnt/sum(cnt),4))
most_freq <- revs_cat %>%
group_by(revs_cat$category) %>%
summarise(cnt = n())
top_20 <- head(most_freq[order(-most_freq$cnt),], 20)[1]
subset <- revs_cat[revs_cat$category %in% top_20$`revs_cat$category`,]
temp <- subset %>%
group_by(subset$category, subset$stars) %>%
summarise(cnt = n()) %>%
mutate(perc = round(cnt/sum(cnt),4))
# plotting top 20 cuisine types
ggplot(data = temp, aes(temp$`subset$stars`, perc)) +
geom_col(fill = "darkblue", colour = "black") +
labs(y = "Cuisine Type", x = "Stars") +
coord_flip() +
facet_wrap(~`subset$category`, scales = "free") +
ggtitle("Stars by Cuisine Type")
Below we visualized the distribution of review sentiment scores for each of the top 20 most frequent cuisines using boxplots. The review sentiment scores are generally centered at a sentiment score between 0.2 and 0.3 and show a similar spread for all cuisines. The sentiment scores for the French cuisine sentiment scores are more tightly distributed about its center (which is on the higher end of the 0.2-0.3 interval) than other cuisines.
ggplot(subset, aes(category, text_1)) +
geom_boxplot() +
coord_flip() +
labs(y = "Review Sentiment", x = "Cuisine Type") +
ggtitle("Review Sentiment of Top 20 Cuisines")
We then took a closer at the actual reviews themselves and see what insights we can gain from them. In order to do so, we reverted to a traditional method, term frequency, to see which words would most appear in certain reviews. For this purpose, we defined two categories of reviews: those that have an average sentiment above 0.5 (“Good Reviews”) and below -0.5 (“Bad Reviews”). We subsequently created a set of wordclouds to understand which words are most prevalent in each type of review and hence potentially predictive of the type of review. The result can be seen below
setwd("~/EDAV_project/Data")
good_revs <- read_csv("good_revs.csv")
bad_revs <-read_csv("bad_revs.csv")
set.seed(1234)
wordcloud(words = good_revs$word, freq = good_revs$freq, min.freq = 300,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
wordcloud(words = bad_revs$word, freq = bad_revs$freq, min.freq = 5000,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
As expected, good reviews include words such as “great”, “excellent” and “awesome”, while bad reviews contain more negative words, such as “worst”, “terrible” and “disappointed”. Althouth we expected a result akin to this, we did not predict such a strong polarization between reviews, especially given that “Bad Reviews” are extremely rare, as seen below:
ggplot(toronto, aes(toronto$text_1)) +
geom_histogram(aes(y = (..count..) / sum(..count..)), bins = 20, fill = "darkblue", color =
"black") +
xlab("Average Review Sentiment") +
ylab("Frequency") +
ggtitle("Histogram of Review Sentiment")
#ggsave(filename = "fig22.1.png", width = 4, height = 4, dpi = 200)
We finally took a look at elite users. Yelp elite users, akin to Piazza super users for this class, are particular Yelp members who post reviews often, check-in often and answer questions frequently. We wanted to see if maybe elite users tend to disporportionally post either good or bad reviews, compared to the average user. The picture below shows one such attempt, plotting teh average review sentiment for reviews posted by elite users vs. non-elite users
elite_sent <- toronto %>%
group_by(elite) %>%
summarize(mean = mean(text_1, na.rm = TRUE))
ggplot(elite_sent, aes(elite, mean)) +
geom_col(color = "darkblue", fill = "darkblue", width = 0.5) +
xlab("Elite") +
ylab("Average Review Sentiment") +
ggtitle("Average Sentiment Scores of Reviews by Non-Elite Users vs. Elite Users")
We can see there is no discernable difference in the average review sentiment of elite users when compared to non-elite users. Our next task was to figure out, given that the elite reviews generally have the same sentiment as non-elite reviews, whether or not elite user reviews would have a subsequent effect on restaurant ratings. Specifically, we wanted to figure out whether a good review by an elite user would result in an increase in sentiment of subsequent reviews by non-elite users. In order to do so, we decided to create a case study, as shown next.
Case Study: Pai Northern Thai Kitchen
In order to investigate a poential relationship between elite reviews and subsequent review sentiment more closely, we decided to look more closely at business with the most reviews in the Toronto restaurant dataset (Pai Northern Thai Kitchen in Toronto’s Entertainment District). We then plotted the progession of the restaurant’s sentiment over the time horizon provided in the dataset, and highlighted and elite review, in order to see whether or not there was an effect. This can be seen in the graph to the right below.
# read in dataset of all reviews for Toronto businesses, text_1 column is the sentiment value
# sentiment value calculated using python library
toronto = read_csv("~/EDAV_project/Data/sentiment_reviews_Toronto.csv")
# obtained number of reviews for each business id
most_rev <- data.frame(table(toronto$business_id))
# arranged business ids in decreasing order of number of reviews and converted business id column to character
most_rev <- most_rev[order(-most_rev$Freq),]
most_rev$Var1 <- as.character(most_rev$Var1)
# obtained business id with greatest number of reviews
var <- most_rev$Var1[1]
# obtain all reviews for business id with greatest number of reviews and order by date
reviews <- subset(toronto, business_id == var)
reviews <- reviews[order(as.Date(reviews$date, format="%Y-%m-%d")),]
# add year and month columns
reviews$year <- as.numeric(format(as.Date(reviews$date, format="%Y-%m-%d"),"%Y"))
reviews$month <- as.numeric(format(as.Date(reviews$date, format="%Y-%m-%d"),"%m"))
# convert elite column to dummy variable
reviews$elite <- as.factor(reviews$elite)
levels(reviews$elite) <- c(0,1)
# trim columns of review data frame
reviews <- reviews[c("business_id", "date", "stars", "elite", "text_1", "year", "month")]
reviews$date <- as.Date(reviews$date)
# Stationary Sentiment Progression over Time
elite <- subset(reviews, reviews$elite == 1)
ggplot(reviews, aes(date, text_1)) +
geom_line(color='grey') +
scale_x_date(date_labels = ("%b-%Y"),
limits = as.Date(c(reviews$date[1], reviews$date[nrow(reviews)]))) +
xlab("Date") +
ylab("Sentiment") +
ylim(-1,1) +
geom_point(data = elite, aes(date, text_1), color='darkblue', shape = 5, size = 0.5) +
ggtitle("Sentiment over Time (3.5 Year Period)")
We noticed that the data was too clustered, with too many elite reviews in a short time period, in order to get a good sense of a potential relationship. We solved this issue, by plotting the same sentiment, but this time only over the last 6-months, instead of the full 3.5 years. The resulting graph is depicted the below.
reviews_2017 <- subset(reviews, reviews$date > as.Date("2017-06-01"))
# Stationary Graph for Sentiment Progression June - December 2017
elite <- subset(reviews_2017, reviews_2017$elite == 1)
ggplot(reviews_2017, aes(date, text_1)) +
geom_line(color = 'grey') +
scale_x_date(date_labels = ("%b-%Y"),
limits = as.Date(c(reviews_2017$date[1], reviews_2017$date[nrow(reviews_2017)]))) +
xlab("Date") +
ylab("Sentiment") +
ylim(-1,1) +
geom_point(data = elite, aes(date, text_1), color='darkblue', shape = 5, size = 0.5) +
ggtitle("Sentiment Progression from June to December 2017")
We can make a couple observations here:
- The sentiment of elite reviews tends to hover in the 0.1 to 0.4 area (note that it is positive the vast majority of the time)
- We can see that a surprisingly large amount of elite reviews happen to have a sentiment that is near (or equal to) either the best or the worst average review sentiment of that given day
- We can further note that there are a non-trivial set of instances in which sentiment rises after a poor elite review and drops after a good elite review, such as the large drop around November 2017
- Further data and analysis would be required in order to identify potential causes of such strange patterns